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the waveforms are perfectly periodic, and stored as 
one of their period. 

synthesis is obtained by overlap-adding of the 
waveforms obtained from time-domain multiplica- 
tion of the periodic waveforms with a weighting win- 
dow whose size is approximately two times the pe- 
riod of the signals to weight, and whose relative po- 
sition inside of the period is fixed to any value iden- 
tical for all the periods: 

whereby the time shift between two successive wave- 
forms obtained by weighting the original signals is set 
to the imposed fundamental frequency of the signal to 
synthesize. 
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Description 

The invention described herein relates to a method 
of synthesis of audio sounds To simplify the description, 
focus is mainly made on vocal sounds, keeping in mind 
that the invention can be applied to the field of music 
synthesis as well. 

Background of the invention 

In the framework of the so-called "concatenative" 
synthesis techniques which are increasingly used, syn- 
thetic speech is produced from a database of speech 
segments. Segments may be diphones : for example, 
which begin from the middle of the stationary part of a 
phone (the phone being the acoustic realization of a 
phoneme) and end in the middle of the stationary part 
of the next phone. French, for instance, is composed of 
36 phonemes, which corresponds to approximately 
1 240 diphones (as a matter of fact some combination of 
phonemes are impossible). Other types of segments 
can be used, like triphones, poiyphones, half-syllables, 
etc. Concatenative synthesis techniques produce any 
sequence of phonemes by concatenating the appropri- 
ate segments. The segments are themselves obtained 
from the segmentation of a speech corpus read by a hu- 
man speaker. 

Two problems must be solved during the concate- 
nation process in order to get a speech signal compa- 
rable to human speech. 

The first problem arises from the disparities of the 
phonemic contexts from which the segments were ex- 
tracted, which generally results in some spectral enve- 
lope mismatch at both ends of the segments to be con- 
catenated. As a result, a mere concatenation of seg- 
ments leads to sharp transitions between units, and to 
less fluid speech. 

The second problem is to control the prosody of 
synthetic speech, i.e its rhythm (phoneme and pause 
lengths) and its fundamental frequency (the vibration 
frequency of the vocal folds). The point is that the seg- 
ments recorded in the corpus have their own prosody 
that does not necessarily correspond to the prosody im- 
posed at synthesis time. 

Hence there is a need to find a means of controlling 
prosodic parameters and of producing smooth transi- 
tions between segments, without affecting the natural- 
ness of speech segments. 

One distinguishes two families of methods to solve 
such problems; the ones that implement a spectral mod- 
el of the vocal tract, and the ones that modify the seg- 
ment waveforms directly in the time domain. 

In the first category of methods, transitions between 
concatenated segments are smoothed out by comput- 
ing the difference between the spectral envelopes on 
both sides of the concatenation point, and propagating 
this difference in the spectral domain on both segments. 
The way it controls the pitch and the duration of seg- 



ments depends on the particular model used for spectral 
envelope estimation. All these methods require a high 
computational load at synthesis time, which prevents 
them from being implemented in real time on low-cost 
5 processors. 

On the contrary the second family of synthesis 
methods aims to produce concatenation and prosody 
modification directly in the time domain with very limited 
computational load All of them take advantage of the 
w so-called "Potsson's Sum Theorem", well known among 
signal processing specialists which demonstrates that it 
is possible to build from any finite waveform with a given 
spectral envelope an infinite waveform with the same 
spectral envelope for an arbitrarily chosen (and con- 
's stant) pitch. This theorem can be applied to the modifi- 
cation of the fundamental frequency of speech signals. 
Provided the spectrum of the elementary waveforms is 
close to. the spectral envelope of the signal one wishes 
to modify, pitch can be imposed by setting the shift be- 
20 tween elementary waveforms to the targeted pitch peri- 
od, and by adding the resulting overlapping waveforms. 
In this second family, synthesis methods mainly differ in 
the way they derive elementary waveforms from the pre- 
recorded segments. However, in order to produce nigh- 
ts quality synthetic speech, the overlapping elementary 
waveforms they use must have a duration of at least 
twice the fundamental frequency of the original seg- 
ments. Two classes of techniques in this second family 
of synthesis methods will be described hereafter. 
30 The first class refers to methods hereafter referred 

to as 'PSOLA' methods (Pitch Synchronous Overlap 
Add), characterized by the direct extraction of wave- 
forms from continuous audio signals. The audio signals 
used are either identical to the original signals (the seg- 
35 ments), or obtained after some transformation of these 
original signals. Elementary waveforms are extracted 
from the audio signals by multiplying the signals with fi- 
nite-duration weighting windows positioned synchro- 
nously with the fundamental frequency of the original 
■to signal. Since the size of the elementary waveforms must 
be at least twice the original period, and given that there 
is one waveform for each period of the original signal, 
the same speech samples are used in several succes- 
sive waveforms: the weighting windows overlap in the 
audio signals. 

Examples of such PSOLA methods are those de- 
fined in documents EP-0363233. US-5479564, EP- 
0706170. A specific example is also the MBR-PSOLA 
method as published by T Dutoit and H. Leich, in 
50 Speech Communication, Elsevier Publisher November 
1993, Vol. 13, N° 3-4, 1993. The method described in 
document US-5479564 suggests a means of modifying 
the frequency of an audio signal with constant funda- 
mental frequency by overlap-adding short-term signals 
55 extracted from this signal. The length of the weighting 
windows used to obtain the short-term signals is approx- 
imately equal to two times the period of the audio signal 
and their position within the period can be set to any 
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value (provided the time shift between successive win- 
dows is equal to the period of the audio signal). Docu- 
ment US-5479564 also describes a means of interpo- 
lating waveforms between segments to concatenate, so 
as to smooth out discontinuities. This is achieved by 
modifying the periods corresponding to the end of the 
first segment and to the beginning of the second seg- 
ment, in such a way as to propagate the difference be- 
tween the last period of the first segment and the first 
period of the second segment 

The second class of techniques, hereafter referred 
to as 'analytic', is based on a time-domain modification 
of waveforms that do not share, even partially, their sam- 
ples. The synthesis step still uses shifting and overlap- 
adding of the weighted waveforms carrying the spectral 
envelope information. These waveforms are no longer 
extracted from a continuous speech signal by means of 
overlapping weighting windows. Examples of these 
techniques are those defined in documents US- 
5369730 and GB-2261 350. as well as by T. Yazu. K. Ya- 
mada. "The speech synthesis system for an unlimited 
Japanese vocabulary", in proceedings IEEE ICASSP 
1986. Tokyo, pp. 2019-2022. 

In all these 'analytic' techniques, elementary wave- 
forms are impulse responses of the vocal tract comput- 
ed from evenly spaced speech signal frames, and re- 
synthesized via a spectral model. The present invention 
falls in this class of methods. 

An advantage of analytic methods over PSOLA 
methods is that the waveforms they use result from a 
true spectral model of the vocal tract. Therefore, they 
can intrinsically model the instantaneous spectral enve- 
lope information with more accuracy and precision than 
PSOLA techniques, which simply weight a time-domain 
signal with a weighting window. Moreover, it is possible 
with analytic methods to separate the periodic (voiced) 
and aperiodic (unvoiced) components of each wave- 
form, and modify their balance during the resynthesis 
step in order to modify the speech quality (soft, harsh, 
whispered, etc). 

In practice, this advantage is counterbalanced by 
an increase of the size of the resynthesized segment 
database (typically a factor 2 since the successive 
waveforms do not share any samples while their dura- 
tion still has to be equal to at least two times that of the 
pitch period of the audio signal). The method described 
by MM. Yazu and Yamada precisely aims at reducing 
the number of samples to be stored, by resynthesizing 
impulse responses in which the phases of the spectral 
envelope are set to zero. Only half of the waveform 
needs to be stored in this case, since phase zeroing re- 
sults in perfectly symmetrical waveforms The main 
drawback of this method is that it greatly affects the nat- 
uralness of the synthetic speech. It is well known, in- 
deed, that performing important phase distortions have 
a strong effect on speech quality. 



Aim of the invention 

The present invention aims to suggest a method for 
audio synthesis that avoids the drawbacks presented in 
5 the state of the art and which requires limited storage 
for the waveforms white avoiding important distortions 
of the natural phase of acoustic signals. 

Main characteristic elements of the invention 

w 

The present invention relates to a method for audio 
synthesis from waveforms stored in a dictionary char- 
acterized by the following points: 

>5 - the waveforms are infinite and perfectly periodic 
and are stored as one of their period, itself repre- 
sented as a sequence of sound samples of a priori 
of any length: 

Synthesis is carried out by overlapping and adding 
20 the waveforms multiplied by a weighting window 
whose length is approximately two times the period 
of the original waveform, and whose position rela- 
tively to the waveform can be set to any fixed value: 

2S The time shift between two successive weighted signals 
obtained by weighting the original waveforms is equal 
to the fundamental period requested for the synthetic 
signal, whose value is imposed. This value may be lower 
or greater than that of the original waveforms. 

oo The method according to the present invention, ba- 
sically differs from any other analytic' method by the 
fact that the elementary waveforms used are not im- 
pulse responses of the vocal tract, but infinite periodic 
signals, multiplied by a weighting window to keep their- 

35 length finite, and carrying the same spectral envelope 
as the original audio signals. A spectral model (hybrid 
harmonic/stochastic model, for instance, although the 
invention is not exclusively related to any particular 
spectral model) is used for resynthesis in order to get 

40 periodic waveforms (instead of the symmetric impulse 
responses of MM. Yazu and Yamada) carrying instan- 
taneous spectral envelope information. Because of the 
periodicity of the elementary waveforms produced, only 
the first period need to be stored. The sound quality ob- 

•*s tained by this method is incomparably superior to the 
one of MM. Yazu and Yamada. since the computation 
of the periodic waveforms do not impose phase con- 
straints on the spectral envelopes, thereby avoiding the 
related quality degradation. 

so The periods that need to be stored are obtained by 
spectral analysis of a dictionary of audio segments (e. 
g diphones in the case of speech synthesis). Spectral 
analysis produces spectral envelope estimates through- 
out each segment. Harmonic phases and amplitudes 

55 are then computed from the spectral envelope and the 
target period (i.e. the spectral envelope is sampled with 
the targeted fundamental frequency). 

The length of each resynthesis period can advan- 
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tageously be chosen equal for all the periods of all the 
segments. In this particular case, classical techniques 
for waveform compression (e.g. ADPCM) allow very 
high compression ratios (about 8) with very limited com- 
putational cost for decoding. The remarkable efficiency 
of such techniques on the waveforms obtained mainly 
originates from the fact that: 

all the periods stored in the segment database have 
the same length, which leads to a very efficient pe- 
riod to period differential coding scheme; 
the use of a spectral model for spectral envelope 
estimation allows the separation of harmonic and 
stochastic components of the waveforms. When the 
energy of the stochastic component is low enough 
compared to that of the harmonic component, it may 
be completely eliminated, in which case only the 
harmonic component is resynthesized. This results 
in waveforms that are more pure, noiseless, and ex- 
hibit more regularity than the original signal, which 
additionally enhances the efficiency of ADPCM cod- 
ing techniques 

To further enhance the efficiency of coding tech- 
niques, the phases of the lower-order (i.e., lower fre- 
quency) harmonics of each stored period may be fixed 
(one phase value fixed for each harmonic of the data- 
base) for the resynthesis step. The frequency band 
where this setting is acceptable ranges from 0 to ap- 
proximately 3 kHz. In this case, the resynthesis opera- 
tion results in a sequence of periods with constant 
length, in which the time-domain difference between two 
successive periods is mainly due to spectral envelope 
differences. Since the spectral envelope of audio sig- 
nals generally changes slowly with time, given the iner- 
tia of the physical mechanisms that produce them, the 
shape of the periods obtained in this way also evolve 
slowly. This, in turn, is particularly efficient when it 
comes to coding signals on the basis of period to period 
differences. 

Independently of its use for segment coding, the 
idea of imposing a set of fixed values for the phases of 
the lower frequency harmonics leads to the implemen- 
tation of a temporal smoothing technique between suc- 
cessive segments, to attenuate spectral mismatch be- 
tween periods. The temporal difference between the last 
period of the first segment and the first period of the sec- 
ond segment is computed, and smoothly propagated on 
both sides of the concatenation point with a weighting 
coefficient continuously varying from -0.5 to 0.5 (de- 
pending on which side of the concatenation point is 
processed). 

It should be noted that although the efficient coding 
properties and smoothing capabilities mentioned above 
were already available in the MBR-PSOLA technique as 
described in the state of the art, their effect is drastically 
reinforced in the present invention as opposed to the 
waveforms used by MBR-PSOLA. the periods used 



here do not share any of their samples, allowing a per- 
fect separation between harmonically purified wave- 
forms, and waveforms that are mainly stochastic. . 
Finally, the present invention still makes it possible 

5 to increase the quality of the synthesized audio signal 
by associating, with each resynthesized segment (or 
'base segment'), a set of replacement segments similar 
but not identical to the base segment. Each base seg- 
ment is processed in the same way as the correspond- 

io jng base segment, and a sequence of periods is resyn- 
thesized. For each replacement segment, for instance, 
one can keep two periods corresponding respectively to 
the beginning and the end of the replacement segment 
at synthesis time. When two segments are about to be 

'5 concatenated, it is then possible to modify the periods 
of the first base segment so as to propagate, on the last 
periods of this segment, the difference between the last 
period of the base segment and the last period of one 
of its replacement segments. Similarly, it is possible to 

20 modify the periods of the second base segment so as 
to propagate, on the first periods of this segment, the 
difference between the first period of the base segment 
and the first period of one of its replacement segments. 
The propagation of these differences is simply per- 

25 formed by multiplying the differences by a weighting co- 
efficient continuously varying from 1 to 0 (from period to 
period) and addinq the weiqhted differences to the pe- 
riods of the base segments. 

Such a modification of the time-domain periods of 

oo a base segment so as to make it sound like one of its 
replacement segments can be advantageously used to 
produce free variants to a base sound, thereby avoiding 
the monotony resulting from the repeated use of a base 
sound. It can also be put to use for the production of 

35 linguistically motivated sound variants (e.g., stressed/ 
unstressed vowels, tense/soft voice, etc.) 

The fundamental difference between the method 
described in the state of the art, which according. to our 
classification is a 'PSOLA' method, and the method of 

JO the present invention originates in the particular way of 
deriving the periods used. As opposed to the waveforms 
extracted from a continuous signal as proposed in the 
state of the art, the waveforms used in the present in- 
vention do not share any of their samples (hence, they 

•*$ do not overlap). It therefore benefits from the typical ad- 
vantages of other analytic methods: 

very efficient coding techniques which account for 
the fact that: 

so 

periods can be harmonically purified by com- 
pletely eliminating their stochastic component: 
when resynthesizing periods, the phase of tow- 
frequency harmonics can bet set constant (i.e.. 
55 one fixed value for each harmonic throughout 

the segment database) 

Ability to produce sound variants by interpolating 
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between base and replacement segments. For 
each base segment, for instance, two additional pe- 
riods are stored, corresponding to the beginning 
and end of the segment and taken from a replace- 
ment segment. This enables the synthesis of more 
natural sounding voices. 

Brief description of the drawings 

The method according to the present invention shall 
be more precisely described by comparing it with the 
following state-of-the-art methods: 

Figure 1 illustrates the different steps of speech syn- 
thesis according to a PSOLA method, 

Figure 2 describes the different steps of speech syn- 
thesis according to the method proposed 
by MM. Yazu and Yamada, 

Figure 3 describes the different steps of speech syn- 
thesis in accordance to the present inven- 
tion. 

Description of a preferred embodiment of the 
invention 

Figure 1 shows a classical representation of a PSO- 
LA method characterized by the following steps: 

1 . At least on the voiced parts of speech segments, 
an analysis is performed by weighting speech with 
a window approximately centered on the beginning 
of each impulse response of the vocal tract excited 
by the vocal folds. The weighting window has a 
shape that decreases down to zero at its edges, and 
its length isat least approximately two times the fun- 
damental period of the original speech, or two times 
the fundamental period of the speech to be synthe- 
sized 

2. The signals that result from the weighting opera- 
tion are shifted from each other, the shift being ad- 
justed to the fundamental period of the speech to 
be synthesized, lower or greater than the original 
one, following the prosodic information related to 
the fundamental period at synthesis time. 

3. Synthetic speech is obtained by summing these 
shifted signals. 

Figure 2 shows the method described by MM. Yazu 
and Yamada according to the state of the art which im- 
plements 3 steps: 

1 . The original speech is cut out every fixed frame 
period (hence, not pitch synchronously), and the 
spectrum of each frame is computed by cepstral 
analysis. Phase components are set to zero, so that 
only spectral amplitudes are retained. A symmetric 
waveform is then obtained for each initial frame by 
inverse FFT This symmetric waveform is weighted 



with a fixed length window that decreases to almost 
zero at its borders. 

2. The signals that result from the weighting opera- 
tion are shifted from each other, the shift being ad- 

5 justed to the fundamental period of the speech to 

be synthesized, lower or greater than the original 
one. following the prosodic information related to 
the fundamental period at synthesis time. 

3. Synthetic speech is obtained by summing these 
io shifted signals. 

In this last technique, steps 1 and 2 are often real- 
ized once for all. which makes the difference between 
analytic methods and those based on a spectral model 
is of the vocal tract. The processed waveforms are stored 
in a database that centralizes, in a purely temporal for- 
mat, all the information related to the evolution of the 
spectral envelope of the speech segments. 

Concerning the preferred implementation of the in- 
20 vention herein described, figure 3 describes the follow- 
ing steps: 

1. Analysis frames are assigned a fixed length and 
shift (denoted by S). Instead of estimating the spee- 
ds tral envelope of each analysis frame by cepstral 
analysis and computing its inverse FFT (as done by 
MM. Yazu and Yamada), the analysis algorithm of 
the powerful MBE (Multi-Band Excited) model is 
used, which computes the frequency, amplitude, 
30 and phase of each harmonic of the analysis frame. 

The spectral envelope is then derived for each 
frame and modify the frequencies and amplitudes 
of harmonics without changing this envelope, so as 
to obtain a fixed fundamental frequency equal to the 
35 analysis shift, S (i.e.. the spectrum is "re-harmo- 
nized" in the frequency domain). Phases of the low- 
er harmonics are set to a set of fixed values (i.e., a 
value chosen once for all for a given harmonic 
number). Time-domain waveforms are then ob- 
•*o tained from harmonics by computing a sum of sinu- 
soids, the frequencies, amplitudes, and phases are 
set equal to those of harmonics. As opposed to the 
invention of MM Yazu and Yamada, the waveforms 
are not symmetrical, as phases have not been set 
to zero (there was no other choice in the previous 
method) . Furthermore, the precise waveforms ob- 
tained are not imposed by the algorithm, as they 
strongly depend on the fixed phase values imposed 
before resynthesis. Instead of storing the complete 
50 waveform in a segment database, one period of the 
waveform is only kept, since it is perfectly periodic 
by construction (sum of harmonics). This peridd can 
be unfolded to obtain the corresponding infinite 
waveform as required for the next step. 
55 2. On the voiced parts of speech seqments. an anal- 
ysis is performed by weighting the aforementioned 
re-synthesized waveform (obtained by looping one 
of its periods computed as a sum of harmonics) with 
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a window with fixed length. The weighting window 
has a shape that decreases down to zero at its edg- 
es, and its length is exactly two times the value of 
S. and therefore also two times the fundamental pe- 
riod the re-synthesized speech obtained in step 1 
One such window is taken from each infinite wave- 
form derived in step 1 . 

3. The signals that result from the weighting opera- 
tion are overlapped and shifted from each other, the 
shift being adjusted to the fundamental period of the 
speech to be synthesized, lower or greater than S. 
following the prosodic information related to the fun- 
damental period at synthesis time. Synthetic 
speech is obtained by summing these shifted sig- 
nals. 

The invention makes it possible to smooth out spec- 
tral discontinuities in the time domain due to the fixed 
set of phases applied to the periods during the resyn- 
thesis step for lower-order harmonics, since an interpo- 
lation between two such periods in the time-domain is 
then equivalent to an interpolation in the frequency do- 
main. 



Claims 



characterized in that the phases of the lower-fre- 
quency harmonics (typically from 0 to 3 kHz) of the 
stored periodic waveforms have a fixed value per 
harmonic throughout the dictionary. 

5 

5. Method for audio synthesis according to any of the 
preceding claims, characterized in that the stored 
waveforms are obtained from the spectral analysis 
of a dictionary of audio signal segments such as di- 

io phones in the case of speech synthesis whereby a 
spectral analysis provides at regular time intervals 
an estimate of the instantaneous spectral envelope 
in each segment from which the waveforms are 
computed. 

is 

6. Method for audio synthesis according to claim 5, 
characterized in that when concatenating two seg- 
ments, the last periods of the first segment and the 
first period of the second segment are modified to 

20 smooth out the time-domain difference measured 
between the last period of the first segment and the 
first period of the second segment, this time-domain 
difference being added to each modified period with 
a weighting coefficient varying between -0.5 and 0.5 

25 depending on the position of the modified period 

with respect to the concatenation point. 



Method for audio synthesis from waveforms stored 
in a dictionary, characterized in that the following 
steps are performed: 

the waveforms are infinite and perfectly period- 
ic, and are stored as one of their period, itself 
represented as a sequence of sound samples 
of a priori any length: 

a synthesis is carried out by overlapping and 
adding the waveforms multiplied by a weighting 
window whose length is approximately two 
times the period of the original waveform, and 
whose position relatively to the waveform can 
be set to any fixed value: 

whereby the lime shift between two successive 
weighted signals obtained by weighting the original 
waveforms is equal to the fundamental period re- 
quested for the synthetic signal, whose value is im- 
posed. 



7. Method for audio synthesis according to claim 6 : 
characterized in that for each base segment, re- 

30 placement segments are stored whereby at synthe- 

sis time, when two segments are about to be con- 
catenated, the periods of the first base segment are 
modified so as to propagate, on the last periods of 
this segment, the difference between the last period 

35 of the base segment and the last period of one of 
its replacement segments and whereby the periods 
of the second base segment are modified so as to 
propagate, on the first periods of this segment, the 
difference between the first period of the base seg- 

-to ment and the first period of one of its replacement 
segments, the propagation of these differences be- 
ing performed by multiplying the measured differ- 
ences by a weighting coefficient continuously vary- 
ing from 1 to 0 (from period to period) and adding 

-*s the weighted differences to the periods of the base 

segments. 



Method for audio synthesis according to claim 1 
characterized in that the fundamental period of the 
synthetic signal is greater or lower than the original 
period in the dictionary. 



so 



Method for audio synthesis according to claim 1 or 
2 characterized in that the lengths of the periods 
stored in the dictionary are all identical 



55 



Method for audio synthesis according to claim 3 
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